Robust Web Data Extraction with XML Path Expressions

نویسندگان

  • Jussi Myllymaki
  • Jared Jackson
چکیده

Automated extraction of structured Web data has attracted considerable interest in both the academia and industry. A particularly promising approach is to employ XML technologies to translate semi-structured HTML documents to “pure” XML documents. In this approach, HTML documents are first normalized into XHMTL and then mapped to the desired XML application format by using XML path expressions and regular expressions. In this paper we describe a methodology for creating XML path (XPath) expressions that are capable of extracting data from virtually any HTML page, while placing an emphasis on the persistent integrity of these expressions. This robustness is critical given the vulnerability of extraction technologies to the continually changing content, structure, and formatting of pages on the Web. We define categories of extraction rules in terms of their dependence on content, structural, or formatting features, and provide practical tips on how to create dependable data extraction patterns for the Web.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Towards Building XML Statistics for the Hidden Web

There is currently a lot of interest in developing Internet query processors that can pose elaborate queries on XML data on the Web. Such query processors can query data sources that have static XML files, but they should also be able to query “hidden Web” data sources that export an XML view of data stored in a database. To optimize queries that involve these hidden Web data sources, we need t...

متن کامل

Repairing Inconsistent Merged XML Data

XML is rapidly becoming one of the most adopted standard for information representation and interchange over the Internet. With the proliferation of mobile devices of communication such as palmtop computers in recent years, there has been growing numbers of web applications that generate tremendous amount of XML data transmitted via the Internet. We therefore need to investigate an effective me...

متن کامل

Report on the Eighth International Workshop on Knowledge Representation Meets Databases ( KRDB ) , September 15 , 2001

The Eighth International Workshop on Knowledge Representation Meets Databases (KRDB) was held at the Ponti cia Universit a Urbaniana, in Rome, right after VLDB 2001. KRDB was initiated in 1994 to provide an opportunity for researchers and practitioners from the two areas to exchange ideas and results. This year's focus was on Modeling, Querying andManaging Semistructured Data. The one day progr...

متن کامل

On-Line Selectivity Estimation for XML Path Expressions using Markov Histograms

The extensible mark-up language (XML) is gaining widespread use as a format for data exchange and storage on the World Wide Web. Queries over XML data require accurate selectivity estimation of path expressions in order to optimize query execution plans. Selectivity estimation of XML path expression is usually done based on summary statistics about the structure of the underlying XML repository...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002